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Abstract 

Background: RNAseq technology is replacing microarray technology as the tool of choice for gene expression 
profiling. While providing much richer data than microarray, analysis of RNAseq data has been much more 
challenging. To date, there has not been a consensus on the best approach for conducting robust RNAseq analysis. 

Results: In this study, we designed a thorough experiment to evaluate six read count-based RNAseq analysis 
methods (DESeq, DEGseq, edgeR, NBPSeq, TSPM and baySeq) using both real and simulated data. We found the 
six methods produce similar fold changes and reasonable overlapping of differentially expressed genes based on 
p-values. However, all six methods suffer from over-sensitivity. 

Conclusions: Based on the evaluation of runtime using real data and area under the receiver operating 
characteristic curve (AUC-ROC) using simulated data, we found that edgeR achieves a better balance between 
speed and accuracy than the other methods. 
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Background 

The process by which information from a gene is used in 
the synthesis of a functional gene product is called gene 
expression. The process of gene expression is used by all 
known life including eukaryotes, prokaryotes, and viruses 
to generate the macromolecular machinery for life. Gene 
expression analysis is essential for biomedical research. 
The introduction of microarray technology has helped 
biomedical research make significant advances in the last 
decade by allowing high-throughput gene expression 
screening on all known genes. Recently, the introduction 
of RNAseq technology has had a revolutionary impact on 
the field of expression research. RNAseq refers to the use 
of next-generation sequencing (NGS) technologies to 
sequence cDNA in order to get information about a sam- 
ple's RNA content. Compared to microarray technology, 
the RNAseq method offers several distinct advantages. 
First, the detection range of RNAseq is not limited to a 
set of predetermined probes as with microarray technol- 
ogy, so RNAseq is capable of identifying new genes. 
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Second, the resolution of a microarray is limited to the 
gene level for most arrays and the exon level for specially 
designed exon arrays. On the other hand, RNAseq can 
detect expression at the gene, exon, transcript, and cod- 
ing DNA sequence (CDS) levels. Finally and most impor- 
tantly, RNAseq can detect structural variants such as 
alternative splicing and gene fusion. With the maturity of 
NGS technologies, the price of RNAseq has become 
comparable to microarrays. Many researchers have pre- 
dicted the inevitable replacement of microarray by RNA- 
seq [1-3] based on the competitive price and additional 
genomic information contained in RNAseq data. 

With the rich genomic information RNAseq technology 
brings, it also comes with complication in the analysis phase. 
To date, the research community has not yet come to a con- 
sensus on the best approach for analyzing RNAseq data. The 
most popular normalization method for microarray data is 
Robust Multi- array Average (RMA) [4], a form of quantile 
normalization. For RNAseq, one of the popular normaliza- 
tion methods is Reads per Kilobase per Million mapped 
reads (RPKM) [5] or Fragments Per Kilobase of transcript 
per Million mapped reads (FPKM) [6]. The RPKM of a gene 
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is computed as follows: RPKM = 10 9 x , where C is 

N*L 

the number of reads mapped to the gene, N is the total 
number of reads mapped to all genes, and L is the length of 
the gene. FPKM is computed similarly to RPKM, except it 
accounts for the scenario in which only 1 end of a paired- 
end read is mapped. In addition to RPKM and FPKM, other 
read count methods based on Possion, negative binomial, 
and Bayesian methods also exist. Each method has unique 
strengths and weaknesses. In this study, we focus on read 
count-based methods and systematically evaluate 6 RNAseq 
R packages including DESeq [7], DEGseq [8], edgeR [9], bay- 
Seq [10], TSPM [11] and NBPSeq [12] using both real and 
simulated data. BaySeq is considered an empirical Bayes 
approach to detect patterns of differential expression, DESeq 
and NBPSeq are based on a negative binomial model, DEG- 
seq and TSPM are based on a Poisson model, and edgeR 
uses empirical Bayes estimation and exact tests based on the 
negative binomial. 

Method 

The real RNAseq datasets were selected from The Cancer 
Genome Atlas (TCGA) [13,14]. TCGA is a massive, com- 
prehensive, and collaborative project to catalogue genomic 
data for over 20 types of cancers by the National Cancer 
Institute (NCI), the National Human Genome Research 
Institute (NHGRI), and 27 institutes and centers of the 
National Institute of Health (NIH). Gene expression profil- 
ing by RNAseq is one of the major components of geno- 
mic data collected by TCGA. Breast cancer is the only 
cancer type in TCGA that collected expression data on a 
large quantity of tumor-normal paired samples. Thus we 
selected breast cancer tumor-normal paired data (53 pairs) 
as our primary source of real RNAseq data. Differentially 
expressed genes between tumor and normal were identi- 
fied using all six methods at the significance level of 0.05 
with Benjamin-Hochberg False discovery rate (BH FDR) 
adjustment. To evaluate the consistency between the six 
methods, we computed pairwise Spearman's correlations 
as well as intraclass correlation (ICC) for fold change 
values of all genes, along with the corresponding p-values. 
In addition, we evaluated each method's running time. 

For each gene, the count is drawn from the negative 
binomial distribution with the mean and dispersion para- 
meters estimated from the TCGA breast cancer dataset. 
For a given gene m, the fold change is calculated as 
(Pm + P*) d " where Pm is drawn from the gamma distribu- 
tion with shape parameter 0.87 and rate parameter 1.36 
(parameters are estimated from the TCGA breast cancer 
dataset), P* is the lower bound of fold change, and 



1 , if gene m is upregulated 
— 1, ifgenemis downregulated 
0 , otherwise 



We evaluated the methods using datasets simulated to 
present different scenarios corresponding to a given com- 
bination of the following parameters: sample size (5 or 
10), proportion of differentially expressed genes (5% or 
10%), ratio of up-regulated vs. down-regulated (1:1 or 
3:1), lower bound of fold change (1.5 or 1.1), and lower 
bound of depth (5 or 1). A total of the seven most repre- 
sentative scenarios are shown in Table 1. For each sce- 
nario, 30 datasets were simulated from a negative 
binomial distribution. To evaluate the performance of 
the six methods, we calculated the number of genes that 
are significantly differentially expressed (N s ), the rate of 
false positives (FPR), and the rate of true positives {TPR) 
across 30 simulation datasets for each scenario. False dis- 
covery rate (FDR) at levels of 0.1, 0.05 and 0.01 were 
used as the cutoffs. In order to measure the overall per- 
formance, we also computed the area under the curve 
(AUC) across 30 simulation datasets under each scenario. 

Results and discussion 

We performed differential expression analysis on the 53 
tumor-normal paired samples from TCGA using all six 
methods. Out of 16,146 genes, DEGseq identified the 
most number of significant genes at 15,226. That is, 94.3% 
of all genes were identified by DEGseq as statistically 
significantly differentially expressed between tumor and 
normal, which implies that DEGseq is over-sensitive. 
DESeq on the other hand identified the least number of 
differentially expressed genes with 7,171, which is still too 
many to be biologically plausible. NBPSeq, edgeR, TSPM, 
and baySeq identified 10,017, 10,457, 9,519, and 13,203 
differentially expressed genes respectively (Figure 1). In 
terms of up-regulated genes vs. down-regulated genes, 
DESeq, edgeR, and NBPSeq identified more up-regulated 
genes, while TSPM and DEGseq identified more down- 
regulated genes. BaySeq does not provide fold change or 
test statistics, thus no direction of gene expression change 
can be determined through p-value alone. A unique char- 
acteristic associated with baySeq is that there is a randomi- 
zation factor built into its computation model. Using 
baySeq to analyze the same dataset on same parameter 
settings twice will generate slightly different results. In our 
scenario, the difference can be as many as 100 genes. 

Next we computed the overlap of differentially 
expressed genes between methods for both up-regulated 
and down-regulated genes. BaySeq was excluded from 
this analysis due to its lack of directionality (Figure 2). 
For down-regulated genes, 2,521 were identified by the 
rest of five methods, and for up-regulated genes, 4,067 
were identified by all five methods. DESeq, edgeR, and 
NBPSeq observed the fewest singleton genes (defined as 
genes identified by only 1 method) in terms of both up- 
regulated and down-regulated genes. TSPM observed a 
moderate number of singleton genes, and DEGseq 
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Table 1 Scenarios in the simulation comparison study 


Scenario 


Sample size 


DE (%) 


Up/down 


Lower bound of fold change 


Lower bound of depth 




10 


10 


1 




1.5 


5 




2 


10 


1 




1.5 


5 


III 


10 


5 


1 




1.5 


5 


IV 


10 


10 


3 




1.5 


5 


V 


10 


10 


1 




1.1 


5 


VI 


10 


10 


1 




1.5 


1 


VII 


2 


10 


I 




1.1 


1 



observed the most, which is also a reflection of its over- 
sensitivity. 

We studied the relationship of fold change and p-value 
vs. identification frequency of significantly expressed genes 
detected by the six methods (Figure 3). As expected, genes 
identified by all six methods had a stronger signal (e.g., a 
higher fold change value) than genes identified by fewer 
methods. The singleton genes had the lowest fold change 
values (Figure 3 left). However, for p-value the pattern was 
not clear. No significant association between p-value and 
identification frequency was observed. 

We also tested the consistency among the six methods 
at overall significance levels. Pair-wise Spearman's corre- 
lation coefficient of fold change values was computed 
(Figure 4). Most correlation coefficients are very close to 
1, which indicates that the six methods are highly con- 
sistent in terms of ranking genes according to fold 
change despite difference in normalization method. We 
also computed intraclass correlation coefficients (ICC) 
using both fold change and p-value as well. The ICC 
based on fold change was 0.975 which strongly suggests 
that a high degree of agreement among the five methods 
(excluding baySeq because no fold change can be calcu- 
lated) as the fold change value is only affected by differ- 
ent normalization techniques used by different methods 
(Figure 5) However, the ICC for p-value was only 0.50, 






down 


□ 


up 



DESeq edgeR NBPSeq TSPM DEGseq baySeq 

Figure 1 Number of significantly differentially expressed up- 
regulated and down-regulated genes for each method 



indicating that these methods have a low agreement in 
terms of ranking the differentially expressed genes. In 
addition, baySeq produced the worst p-value correlation 
with regard to the other methods. To assess accuracy, 
one has to know a priori which genes are differentially 
expressed between normal and tumor. Hence, we can 
only calculate it in a simulation. 

For simulated data, we compared the number of genes 
that are significantly differentially expressed. In contrast to 
the results from real data, baySeq identified the smallest 
number of differentially expressed genes, while DEGseq 
identified largest number of differentially expressed genes, 
significantly more than other methods (Table 2). DESeq, 
edgeR and NBPSeq identified similar numbers of differen- 
tially expressed genes. The similar performance of DESeq, 
edgeR, and NBPseq is probably due to the fact that these 
methods are based on similar principles except the estima- 
tion of the dispersion parameter. Among all six methods, 
baySeq has the smallest FPR, and DEGseq has the largest 
FPR. This indicates that baySeq was more conservative 
compared to other methods. Among DESeq, edgeR and 
NBPSeq, the largest FPR was found by NBPSeq, followed 
by DESeq or edgeR. In general, the FPR of DESeq was 
smaller than the FPR of edgeR. For TPR, the largest TPR 
was found by DEGseq (i.e. it also found too many false 
positives), followed by TSPM. As expected, DESeq, edgeR 
and NBPSeq obtained similar results in TPR. After com- 
paring the results under scenarios I and II, we found that 
sample size has a large effect, especially for TPR. As sam- 
ple size decreases, N s decreases, FPR increases, and TPR 
decreases, respectively. TSPM shows the strongest effects 
on sample size among the methods. After comparing the 
results under scenarios I and III, we found that the pro- 
portion of differentially expressed genes has a relatively 
small effect on FPR and TPR for all methods. As the pro- 
portion decreases, all of N s , FPR, and TPR decrease. After 
comparing the results under scenarios I and IV, we found 
that the proportions of up-regulated and down-regulated 
genes have a large effect on N s and FPR for TSPM and 
NBPSeq. After comparing the results under scenarios I, V, 
and VI, we found that the level of treatment effects and 
depth of reads have small effects for all of the methods 
under reasonable sample size. Comparing the results 
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Figure 2 Venn diagram of the overlap in differentially up-regulated and down-regulated genes among five methods (excluding baySeq) 



under scenarios I and VII, all methods are sensitive to the 
sample size, level of treatment effects, and depth. 

In Table 3, we summarized the average AUC-ROC for 
each scenario. For each scenario, DEGseq has less satisfy- 
ing performance compared to the other methods. Under 
scenarios II, V, and VII, the average AUC-ROC of baySeq 
was the largest. Those three scenarios were simulated for 
the dataset with small sample size or small treatment 
effects. This indicates that baySeq is the best performing 
method when the dataset has a small sample size or small 
treatment effect. Under other scenarios (i.e. I, III, IV, VI), 
the best performance was generally obtained by edgeR and 
DESeq, followed by NBPSeq and TSPM. 

We also measured the runtime for each method run- 
ning on the TCGA breast cancer dataset. On a Dell 3500 



with a 2.8GHz CPU and 12 GB RAM, DESeq took the 
most time to finish with over 2 hours followed by 
NBPSeq (77 minutes), baySeq (47 minutes). TSPM and 
edgeR took significantly less time in comparison with 6 
and 2 minutes, respectively. DEGseq was the speediest of 
the six, finishing in only 12 seconds. Combining informa- 
tion from both runtime and AUC, edgeR performed best 
with relatively short runtime and good AUC-ROC. 

Conclusions 

In this study, we systematically evaluated six read count- 
based RNAseq analysis methods using both real and 
simulated data. BaySeq is the most unique method out of 
the six, because it is based on an empirical Bayesian 
approach which produces no fold change or test statistics 
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Figure 3 The relationship of fold change and p-value vs. identification frequency. 
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Figure 4 Pair-wise Spearman's correlation coefficients of fold change computed among six methods. 



to infer the direction of the gene expression difference. 
This inconvenient feature of baySeq forces users to go 
back to the raw expression value to determine the direc- 
tion. Another unique characteristic of baySeq is the ran- 
domization steps involved in its model. For the same 
data with the same parameter settings, baySeq produces 
slightly different results. Granted, genes with the stron- 
gest fold change will always be detected by baySeq, but 
this minor inconsistency between runs of baySeq does 
produce some inconvenience. From our simulation study, 
we found that the performances of DESeq, edgeR, and 
NBPSeq are similar in most cases. This result is expected 
because all three methods are based on a negative bino- 
mial model and differ principally in the way the disper- 
sion parameter is estimated. 

One of the issues we observed with these six read 
count-based RNAseq analysis methods is that they tend 
to be over-sensitive. For a traditional microarray study, 
the number of differentially expressed genes identified 
by a simple method such as a t-test is highly dependent 
on the sample size. We have tested these six methods 
using smaller sample sizes such as 2 vs. 2, but the 
majority of the methods still produced huge amounts of 
differentially expressed genes (sometimes over 50% of 



total genes), which is a clear sign of over-sensitivity. 
Through more thorough analysis, we found that the five 
methods (excluding baySeq) produced very similar fold 
changes but less similar ranks of p-value. Given these 
facts, we would recommend using the combined criteria 
of fold change and p-value to filter out more false posi- 
tives. Combined with evaluation of runtime from real 
data and AUC-ROC from simulated data, edgeR 
achieved a good balance between speed and accuracy. 

We are far from reaching a consensus on the best 
RNAseq analysis approach. Several unique characteris- 
tics of RNAseq data contribute to the difficulty of RNA- 
seq data analysis. The value for non-expression in 
RNAseq is zero, which means there are no reads aligned 
to the gene. In microarray, there is always background 
intensity for non-expressed genes, thus we can always 
take log transformations of the intensity data. In con- 
trast, due to the large number of zeros in RNAseq data 
(often more than 50%), we cannot take a log transfor- 
mation. The range of RNAseq data is between 0 and 
50,000, compared to microarray's 2 to 15 after RMA 
normalization. This huge range of RNAseq data can 
result in many false high fold changes. Also, there are 
many sequencing and alignment artifacts that can skew 
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Table 2 Performance comparison under different scenarios 







FDR = 0.1 






FDR = 0.05 






FDR = 0.01 




N s 


FPR 


TPR 


N s 


FPR 


TPR 


N s 


FPR 


TPR 


DESeq 


759 


0.0090 


0.6779 


679 


0.0054 


0.6304 


550 


0.0020 


0.5316 


Edger 


825 


0.0115 


0.7216 


740 


0.0068 


0.6799 


614 


0.0023 


0.5934 


DEGseq 


8814 


0.8735 


0.9532 


8864 


0.8573 


0.8573 


8415 


0.8306 


0.9402 


NBPSeq 


1167 


0.0480 


0.7346 


972 


0.0322 


0.6828 


723 


0.0153 


0.5850 


Bayseq 


729 


0.0060 


0.6750 


654 


0.0027 


0.6296 


543 


0.0006 


0.5384 


TSPM 


2307 


0.1687 


0.7893 


1815 


0.0120 


0.7306 


1202 


0.0663 


0.6050 


II DESeq 


289 


0.0153 


0.1515 


175 


0.0096 


0.0890 


59 


0.0037 


0.0255 


Edger 


239 


0.0138 


0.1139 


130 


0.0080 


0.0580 


38 


0.0028 


0.0127 


DEGseq 


8379 


0.8292 


0.9162 


8215 


0.8119 


0.9081 


7921 


0.7801 


0.8923 


NBPSeq 


308 


0.0207 


0.1219 


194 


0.0134 


0.0729 


70 


0.0054 


0.0213 


Bayseq 


189 


0.0070 


0.1260 


88 


0.0029 


0.0617 


10 


0.0003 


0.0067 


TSPM 


291 


0.0262 


0.0557 


262 


0.0234 


0.0513 


217 


0.0191 


0.0447 


III DESeq 


363 


0.0056 


0.6200 


319 


0.0034 


0.5739 


255 


0.0013 


0.4860 


Edger 


401 


0.0068 


0.6729 


354 


0.0041 


0.6301 


284 


0.0014 


0.5414 


DEGseq 


8451 


0.8391 


0.9598 


8288 


0.8222 


0.9555 


8002 


0.7925 


0.9462 


NBPSeq 


567 


0.0241 


0.6767 


474 


0.0169 


0.6283 


346 


0.0084 


0.5335 


Bayseq 


332 


0.0031 


0.6052 


295 


0.0015 


0.5615 


242 


0.0004 


0.4763 


TSPM 


851 


0.0529 


0.6892 


651 


0.0354 


0.6311 


426 


0.018 


0.5035 


IV DESeq 


758 


0.0090 


0.6770 


678 


0.0055 


0.6290 


544 


0.0019 


0.5269 
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Table 2 Performance comparison under different scenarios (Continued) 
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TSPM 
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DESeq 
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0.0080 


0.0725 


53 
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0.0220 


Edger 
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DEGseq 
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0.9091 


7949 


0.7840 


0.8929 


NBPSeq 


253 


0.0170 


0.0991 


153 


0.0108 


0.0562 


59 


0.0047 


0.0165 


Bayseq 


156 


0.0054 


0.1076 


70 


0.0021 


0.0508 


7 


0.0002 


0.0047 


TSPM 


296 


0.0024 


0.0571 


266 


0.0236 


0.0537 


219 


0.0191 


0.0476 



Table 3 Average area under ROC curve (AUC-ROC) under different scenarios 



AUC 





1 


II 


III 


IV 


V 


VI 


VII 


Methods DESeq 


0.9284 


0.7717 


0.9097 


0.9281 


0.8921 


0.9326 


0.7427 


Edger 


0.9322 


0.7608 


0.9213 


0.9323 


0.9001 


0.9393 


0.7336 


DEGseq 


0.6081 


0.6046 


0.6413 


0.5722 


0.6031 


0.6131 


0.6000 


NBPSeq 


0.9129 


0.7559 


0.9209 


0.8496 


0.8828 


0.9206 


0.7330 


Bayseq 


0.9280 


0.7843 


0.9013 


0.9240 


0.9013 


0.9297 


0.7660 


TSPM 


0.8744 


0.7341 


0.9109 


0.7688 


0.8505 


0.8870 


0.7176 



RNAseq data. For example, GC content plays an impor- 
tant factor in the sequenceability of a gene, and the 
exon's length has a large effect on alignment accuracy. 
To date, only 1 RNAseq package, DESeq takes the 
paired data information into consideration. In summary, 
even though there are a large number of RNAseq analy- 
sis tools at our disposal, given the uniqueness of RNA- 
seq data and the number of unsolved problems, there is 
still much room left to improve RNAseq analysis. 
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